Kaggle의 오픈 데이터에서 강의용 데이터로 쓸만한 회귀분석용 데이터를 찾던 중 이 데이터를 찾게 되었다. 링크
이 데이터셋은 2014년 5월부터 2015년 5월까지 매매된 King County의 집값 데이터를 포함하고 있다.
이 데이터를 활용해서 강의에서 전달하고자 하는 바는 Feature Engineering과 Data Exploration의 힘이다. 정말 간단한 몇 개의 과정을 반복하는 것만으로도 예측 모델의 성능을 어디까지 끌어올릴 수 있는지를 보여주고자 했다.
강의를 위해서 준비한 자료를 공유하는 것이기 때문에, 자세한 코멘트는 달지 않았다. 추후 시간을 더 들여서 자세한 설명을 담은 포스트를 작성하고자 한다.
### Load the libraries
library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)
library(corrplot)
library(ggmap)### Load the dataset
House <- read_csv("data/kc_house_data.csv")## Parsed with column specification:
## cols(
## .default = col_integer(),
## id = col_character(),
## date = col_datetime(format = ""),
## price = col_double(),
## bathrooms = col_double(),
## floors = col_double(),
## lat = col_double(),
## long = col_double()
## )
## See spec(...) for full column specifications.
head(House)| id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7129300520 | 2014-10-13 | 221900 | 3 | 1.00 | 1180 | 5650 | 1 | 0 | 0 | 3 | 7 | 1180 | 0 | 1955 | 0 | 98178 | 47.5112 | -122.257 | 1340 | 5650 |
| 6414100192 | 2014-12-09 | 538000 | 3 | 2.25 | 2570 | 7242 | 2 | 0 | 0 | 3 | 7 | 2170 | 400 | 1951 | 1991 | 98125 | 47.7210 | -122.319 | 1690 | 7639 |
| 5631500400 | 2015-02-25 | 180000 | 2 | 1.00 | 770 | 10000 | 1 | 0 | 0 | 3 | 6 | 770 | 0 | 1933 | 0 | 98028 | 47.7379 | -122.233 | 2720 | 8062 |
| 2487200875 | 2014-12-09 | 604000 | 4 | 3.00 | 1960 | 5000 | 1 | 0 | 0 | 5 | 7 | 1050 | 910 | 1965 | 0 | 98136 | 47.5208 | -122.393 | 1360 | 5000 |
| 1954400510 | 2015-02-18 | 510000 | 3 | 2.00 | 1680 | 8080 | 1 | 0 | 0 | 3 | 8 | 1680 | 0 | 1987 | 0 | 98074 | 47.6168 | -122.045 | 1800 | 7503 |
| 7237550310 | 2014-05-12 | 1225000 | 4 | 4.50 | 5420 | 101930 | 1 | 0 | 0 | 3 | 11 | 3890 | 1530 | 2001 | 0 | 98053 | 47.6561 | -122.005 | 4760 | 101930 |
### Initialize a map for King County
kingCounty <- get_map(location = 'issaquah',
zoom = 9,
maptype = "roadmap"
)## Source : https://maps.googleapis.com/maps/api/staticmap?center=issaquah&zoom=9&size=640x640&scale=2&maptype=roadmap&language=en-EN
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=issaquah
### Generate a heat map
ggmap(kingCounty) +
geom_density2d(data = House, aes(x = long, y = lat), size = .3) +
stat_density2d(data = House, aes(x = long, y = lat, fill = ..level.., alpha = ..level..), size = 0.01, bins = 16, geom = "polygon") +
scale_fill_gradient(low = "green", high = "red") +
scale_alpha(range = c(0.2, 0.4), guide = FALSE)House %>%
mutate(
sale_year = year(date),
sale_month = month(date)
) %>%
select(-id, -date) -> Houseset.seed(2017)
trainIdx <- sample(1:nrow(House), size = 0.7 * nrow(House))
train <- House[trainIdx, ]
test <- House[-trainIdx, ]bench_model <- lm(price ~ ., data = train)
summary(bench_model)##
## Call:
## lm(formula = price ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1289272 -98433 -9562 76172 4400147
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.428e+07 1.179e+07 -6.301 3.03e-10 ***
## bedrooms -3.369e+04 2.203e+03 -15.296 < 2e-16 ***
## bathrooms 4.165e+04 3.863e+03 10.782 < 2e-16 ***
## sqft_living 1.444e+02 5.198e+00 27.778 < 2e-16 ***
## sqft_lot 1.202e-01 5.446e-02 2.207 0.027360 *
## floors 5.802e+03 4.248e+03 1.366 0.172069
## waterfront 5.701e+05 2.039e+04 27.952 < 2e-16 ***
## view 5.602e+04 2.482e+03 22.572 < 2e-16 ***
## condition 2.925e+04 2.802e+03 10.439 < 2e-16 ***
## grade 9.494e+04 2.551e+03 37.217 < 2e-16 ***
## sqft_above 2.731e+01 5.154e+00 5.299 1.18e-07 ***
## sqft_basement NA NA NA NA
## yr_built -2.489e+03 8.579e+01 -29.018 < 2e-16 ***
## yr_renovated 2.151e+01 4.304e+00 4.997 5.89e-07 ***
## zipcode -5.576e+02 3.909e+01 -14.265 < 2e-16 ***
## lat 6.133e+05 1.268e+04 48.360 < 2e-16 ***
## long -2.092e+05 1.559e+04 -13.422 < 2e-16 ***
## sqft_living15 2.897e+01 4.061e+00 7.134 1.02e-12 ***
## sqft_lot15 -2.804e-01 8.390e-02 -3.342 0.000833 ***
## sale_year 3.893e+04 5.581e+03 6.976 3.16e-12 ***
## sale_month 1.914e+03 8.338e+02 2.296 0.021713 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 198800 on 15109 degrees of freedom
## Multiple R-squared: 0.7017, Adjusted R-squared: 0.7013
## F-statistic: 1870 on 19 and 15109 DF, p-value: < 2.2e-16
benchmark <- predict(bench_model, test)
benchmark <- ifelse(benchmark < 0, 0, benchmark)### Generate a heat map
ggmap(kingCounty) +
geom_point(data = train, aes(x = long, y = lat, color = log(price), alpha = log(price))) +
scale_color_gradient(low = "green", high = "red")price를 로그화하여 시각화한 결과로, 남부(lat < 47.5)보다 북부(lat >= 47.5) 쪽의 가격이 더 높음을 알 수 있다.cor_House <- cor(House[, -1])
corrplot(cor_House, order = "hclust")train %>%
mutate(grade = factor(grade)) %>%
ggplot(aes(x = grade, y = price, fill = grade)) +
geom_boxplot() +
geom_point(
data = train %>%
group_by(grade) %>%
summarise(median = median(price)) %>%
mutate(grade = factor(grade)),
aes(x = grade, y = median, group = 1),
size = 5, stroke = 2,
color = "black", fill = "white", shape = 23
)grade가 한 단계 높아질 때마다 가격이 기하급수적으로 증가하는 것으로 보인다. 확인을 위해서 log(price)에 대해서 박스플롯을 그려본다.train %>%
mutate(grade = factor(grade)) %>%
ggplot(aes(x = grade, y = log(price), fill = grade)) +
geom_boxplot() +
geom_point(
data = train %>%
group_by(grade) %>%
summarise(median = median(log(price))) %>%
mutate(grade = factor(grade)),
aes(x = grade, y = median, group = 1),
size = 5, stroke = 2,
color = "black", fill = "white", shape = 23
)train %>%
mutate(yr_cat = cut(yr_built, breaks = seq(1900, 2020, by = 10),
labels = paste0(seq(1900, 2010, by = 10), "s"))) %>%
ggplot(aes(x = yr_cat, y = log(price), fill = yr_cat)) +
geom_boxplot()train %>%
filter(yr_renovated != 0) %>%
mutate(renovated_cat = cut(yr_renovated, breaks = seq(1930, 2020, by = 10),
labels = paste0(seq(1930, 2010, by = 10), "s"))) %>%
ggplot(aes(x = renovated_cat, y = log(price), fill = renovated_cat)) +
geom_boxplot()train %>%
mutate(isRenovated = factor(ifelse(yr_renovated != 0, 1, 0))) %>%
ggplot(aes(x = isRenovated, y = log(price), fill = isRenovated)) +
geom_boxplot()train %>%
mutate(sale_year = factor(sale_year)) %>%
ggplot(aes(x = sale_year, y = log(price), fill = sale_year)) +
geom_boxplot()train %>%
mutate(sale_month = factor(sale_month)) %>%
ggplot(aes(x = sale_month, y = log(price), fill = sale_month)) +
geom_boxplot()train %>%
mutate(bathrooms = factor(bathrooms)) %>%
ggplot(aes(x = bathrooms, y = log(price), fill = bathrooms)) +
geom_boxplot()log(price)와 bathrooms는 유사 선형관계를 가진다.train %>%
ggplot(aes(x = lat, y = log(price), color = lat)) +
geom_line() + geom_point(shape = 21)train %>%
ggplot(aes(x = long, y = log(price), color = long)) +
geom_line() + geom_point(shape = 21)sort(unique(train$zipcode)) == sort(unique(test$zipcode))## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
train %>%
arrange(zipcode) %>%
mutate(zipcode = factor(zipcode)) %>%
ggplot(aes(x = zipcode, y = log(price), fill = zipcode)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, vjust = .1))splitLat <- function(data){
data <- data %>%
dplyr::mutate(lat1 = ifelse(lat <= 47.5, lat, 0),
lat2 = ifelse(lat > 47.5 & lat <= 47.6, lat, 0),
lat3 = ifelse(lat > 47.6, lat, 0)) %>%
dplyr::select(-lat)
return(data)
}
train <- splitLat(train)
test <- splitLat(test)train <- train %>%
mutate(isRenovated = ifelse(yr_renovated != 0, 1, 0))
test <- test %>%
mutate(isRenovated = ifelse(yr_renovated != 0, 1, 0))train <- train %>%
mutate(age = ifelse(yr_renovated != 0, 2016 - yr_renovated, 2016 - yr_built))
test <- test %>%
mutate(age = ifelse(yr_renovated != 0, 2016 - yr_renovated, 2016 - yr_built))train$zipcode <- factor(train$zipcode)
test$zipcode <- factor(test$zipcode)
zipcode_train <- data.frame(model.matrix(price ~ 0 + zipcode, data = train))
zipcode_test <- data.frame(model.matrix(price ~ 0 + zipcode, data = test))train <- train %>%
select(-zipcode) %>%
cbind(zipcode_train)
test <- test %>%
select(-zipcode) %>%
cbind(zipcode_test)model <- lm(log(price) ~ ., data = train)
summary(model)##
## Call:
## lm(formula = log(price) ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34466 -0.09549 0.00900 0.10445 0.99991
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.718e+02 1.348e+01 -12.746 < 2e-16 ***
## bedrooms 3.475e-03 2.059e-03 1.687 0.091539 .
## bathrooms 3.492e-02 3.607e-03 9.682 < 2e-16 ***
## sqft_living 1.293e-04 4.842e-06 26.706 < 2e-16 ***
## sqft_lot 5.831e-07 5.052e-08 11.542 < 2e-16 ***
## floors -2.629e-02 4.304e-03 -6.107 1.04e-09 ***
## waterfront 4.803e-01 1.920e-02 25.022 < 2e-16 ***
## view 5.911e-02 2.351e-03 25.144 < 2e-16 ***
## condition 6.463e-02 2.662e-03 24.278 < 2e-16 ***
## grade 8.787e-02 2.489e-03 35.305 < 2e-16 ***
## sqft_above 7.052e-05 4.960e-06 14.218 < 2e-16 ***
## sqft_basement NA NA NA NA
## yr_built 1.412e-05 3.319e-04 0.043 0.966069
## yr_renovated 3.900e-03 5.273e-04 7.396 1.48e-13 ***
## long -2.737e-01 6.348e-02 -4.312 1.63e-05 ***
## sqft_living15 8.920e-05 3.933e-06 22.679 < 2e-16 ***
## sqft_lot15 1.997e-07 7.990e-08 2.500 0.012442 *
## sale_year 6.783e-02 5.156e-03 13.156 < 2e-16 ***
## sale_month 3.004e-03 7.699e-04 3.902 9.58e-05 ***
## lat1 2.801e-01 9.154e-02 3.060 0.002215 **
## lat2 2.825e-01 9.148e-02 3.088 0.002016 **
## lat3 2.829e-01 9.142e-02 3.094 0.001978 **
## isRenovated -7.677e+00 1.046e+00 -7.339 2.26e-13 ***
## age 2.176e-04 3.375e-04 0.645 0.518993
## zipcode98001 -6.004e-01 3.569e-02 -16.821 < 2e-16 ***
## zipcode98002 -6.168e-01 3.805e-02 -16.209 < 2e-16 ***
## zipcode98003 -5.980e-01 3.546e-02 -16.866 < 2e-16 ***
## zipcode98004 2.897e-01 2.206e-02 13.137 < 2e-16 ***
## zipcode98005 -4.895e-02 2.638e-02 -1.855 0.063556 .
## zipcode98006 -1.367e-01 2.759e-02 -4.955 7.29e-07 ***
## zipcode98007 -1.257e-01 2.798e-02 -4.491 7.13e-06 ***
## zipcode98008 -1.270e-01 2.610e-02 -4.867 1.14e-06 ***
## zipcode98010 -2.843e-01 4.438e-02 -6.407 1.53e-10 ***
## zipcode98011 -3.848e-01 2.631e-02 -14.625 < 2e-16 ***
## zipcode98014 -3.790e-01 4.221e-02 -8.978 < 2e-16 ***
## zipcode98019 -4.396e-01 3.590e-02 -12.246 < 2e-16 ***
## zipcode98022 -4.537e-01 4.833e-02 -9.388 < 2e-16 ***
## zipcode98023 -6.622e-01 3.506e-02 -18.890 < 2e-16 ***
## zipcode98024 -2.732e-01 4.449e-02 -6.141 8.39e-10 ***
## zipcode98027 -1.881e-01 3.233e-02 -5.816 6.15e-09 ***
## zipcode98028 -4.212e-01 2.326e-02 -18.107 < 2e-16 ***
## zipcode98029 -1.141e-01 3.418e-02 -3.339 0.000842 ***
## zipcode98030 -5.423e-01 3.469e-02 -15.632 < 2e-16 ***
## zipcode98031 -5.332e-01 3.254e-02 -16.388 < 2e-16 ***
## zipcode98032 -6.520e-01 3.605e-02 -18.085 < 2e-16 ***
## zipcode98033 -3.718e-02 2.174e-02 -1.710 0.087260 .
## zipcode98034 -2.914e-01 2.131e-02 -13.677 < 2e-16 ***
## zipcode98038 -3.837e-01 3.739e-02 -10.261 < 2e-16 ***
## zipcode98039 4.156e-01 3.461e-02 12.007 < 2e-16 ***
## zipcode98040 7.663e-02 2.651e-02 2.890 0.003853 **
## zipcode98042 -5.178e-01 3.494e-02 -14.820 < 2e-16 ***
## zipcode98045 -1.802e-01 4.815e-02 -3.743 0.000183 ***
## zipcode98052 -1.569e-01 2.412e-02 -6.505 8.04e-11 ***
## zipcode98053 -1.789e-01 2.956e-02 -6.051 1.47e-09 ***
## zipcode98055 -4.772e-01 3.013e-02 -15.838 < 2e-16 ***
## zipcode98056 -4.080e-01 2.753e-02 -14.821 < 2e-16 ***
## zipcode98058 -4.548e-01 3.093e-02 -14.703 < 2e-16 ***
## zipcode98059 -3.240e-01 2.930e-02 -11.056 < 2e-16 ***
## zipcode98065 -2.798e-01 4.155e-02 -6.734 1.71e-11 ***
## zipcode98070 -3.816e-01 3.460e-02 -11.029 < 2e-16 ***
## zipcode98072 -3.324e-01 2.741e-02 -12.128 < 2e-16 ***
## zipcode98074 -2.089e-01 2.808e-02 -7.440 1.06e-13 ***
## zipcode98075 -1.880e-01 3.288e-02 -5.718 1.10e-08 ***
## zipcode98077 -3.618e-01 3.144e-02 -11.505 < 2e-16 ***
## zipcode98092 -5.397e-01 3.801e-02 -14.198 < 2e-16 ***
## zipcode98102 1.036e-01 2.557e-02 4.053 5.09e-05 ***
## zipcode98103 -4.685e-02 1.643e-02 -2.851 0.004358 **
## zipcode98105 8.994e-02 2.070e-02 4.345 1.40e-05 ***
## zipcode98106 -4.843e-01 2.412e-02 -20.077 < 2e-16 ***
## zipcode98107 -3.243e-02 1.897e-02 -1.709 0.087475 .
## zipcode98108 -4.541e-01 2.633e-02 -17.245 < 2e-16 ***
## zipcode98109 1.567e-01 2.431e-02 6.446 1.18e-10 ***
## zipcode98112 2.177e-01 1.997e-02 10.899 < 2e-16 ***
## zipcode98115 -3.640e-02 1.718e-02 -2.119 0.034114 *
## zipcode98116 -6.915e-02 2.326e-02 -2.972 0.002959 **
## zipcode98117 -6.207e-02 1.637e-02 -3.792 0.000150 ***
## zipcode98118 -3.440e-01 2.411e-02 -14.270 < 2e-16 ***
## zipcode98119 1.248e-01 2.083e-02 5.992 2.12e-09 ***
## zipcode98122 -3.056e-02 1.935e-02 -1.579 0.114352
## zipcode98125 -2.821e-01 1.917e-02 -14.713 < 2e-16 ***
## zipcode98126 -2.725e-01 2.384e-02 -11.433 < 2e-16 ***
## zipcode98133 -4.175e-01 1.852e-02 -22.543 < 2e-16 ***
## zipcode98136 -1.397e-01 2.514e-02 -5.556 2.80e-08 ***
## zipcode98144 -1.549e-01 2.358e-02 -6.571 5.16e-11 ***
## zipcode98146 -4.872e-01 2.546e-02 -19.133 < 2e-16 ***
## zipcode98148 -4.666e-01 3.982e-02 -11.716 < 2e-16 ***
## zipcode98155 -4.416e-01 2.033e-02 -21.714 < 2e-16 ***
## zipcode98166 -3.704e-01 2.870e-02 -12.903 < 2e-16 ***
## zipcode98168 -6.160e-01 2.705e-02 -22.773 < 2e-16 ***
## zipcode98177 -2.814e-01 2.055e-02 -13.696 < 2e-16 ***
## zipcode98178 -5.611e-01 2.746e-02 -20.433 < 2e-16 ***
## zipcode98188 -5.367e-01 3.216e-02 -16.688 < 2e-16 ***
## zipcode98198 -5.939e-01 3.122e-02 -19.024 < 2e-16 ***
## zipcode98199 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1832 on 15037 degrees of freedom
## Multiple R-squared: 0.8798, Adjusted R-squared: 0.8791
## F-statistic: 1210 on 91 and 15037 DF, p-value: < 2.2e-16
평가 메트릭으로 Root Mean Squared Logarithmic Error(RMSLE)와 Mean Squared Error(MSE)를 사용한다. RMSLE 메트릭은 과대평가된 항목보다는 과소평가된 항목에 페널티를 준다.
\[ RMSLE = \sqrt{\frac{1}{n} \sum^n_{i=1} \left( \log(p_i + 1) - \log(a_i + 1)\right)^2} \]
rmsle <- function(predict, actual){
if(length(predict) != length(actual))
stop("The length of two vectors are different.")
len <- length(predict)
rmsle <- sqrt((1/len) * sum((log(predict + 1) - log(actual + 1))^2))
return(rmsle)
}MSE는 다음과 같다.
\[ MSE = \sqrt{\frac{1}{n} \sum^n_{i=1} (p_i - a_i)^2} \]
mse <- function(predict, actual){
if(length(predict) != length(actual))
stop("The length of two vectors are different.")
len <- length(predict)
mse <- sqrt((1/len) * sum((predict - actual)^2))
return(mse)
}pred <- predict(model, test)
pred <- exp(pred)
result.rmsle <- rmsle(pred, test$price)
benchmark.rmsle <- rmsle(benchmark, test$price)
cat("RMSLE (Benchmark): ", benchmark.rmsle, "\nRMSLE (Final): ", result.rmsle)## RMSLE (Benchmark): 0.9280684
## RMSLE (Final): 0.1794722
result.mse <- mse(pred, test$price)
benchmark.mse <- mse(benchmark, test$price)
cat("MSE (Benchmark):", benchmark.mse, "\nMSE (Final):", result.mse)## MSE (Benchmark): 204989.7
## MSE (Final): 193444.6
library(glmnet)
set.seed(2017)
lambda <- exp(-seq(7, 8, length.out = 400))
ridge.cv <- cv.glmnet(
x = as.matrix(train[, -1]),
y = log(train$price),
alpha = 0,
lambda = lambda,
parallel = TRUE
)ridge.pred <- predict(ridge.cv, as.matrix(test[, -1]), s = ridge.cv$lambda.min)
ridge.pred <- as.vector(exp(ridge.pred))
ridge.rmsle <- rmsle(ridge.pred, test$price)
cat("RMSLE (Ridge):", ridge.rmsle)## RMSLE (Ridge): 0.1797213
ridge.mse <- mse(ridge.pred, test$price)
cat("MSE (Ridge):", ridge.mse)## MSE (Ridge): 192229.2
set.seed(2017)
lambda <- exp(-seq(10, 11, length.out = 400))
lasso.cv <- cv.glmnet(
x = as.matrix(train[, -1]),
y = log(train$price),
alpha = 1,
lambda = lambda,
parallel = TRUE
)lasso.pred <- predict(lasso.cv, as.matrix(test[, -1]), s = lasso.cv$lambda.min)
lasso.pred <- as.vector(exp(lasso.pred))
lasso.rmsle <- rmsle(lasso.pred, test$price)
cat("RMSLE (Lasso):", lasso.rmsle)## RMSLE (Lasso): 0.1797293
lasso.mse <- mse(lasso.pred, test$price)
cat("MSE (Lasso):", lasso.mse)## MSE (Lasso): 192438.7
# Adaptive Lasso Function with Automatic 10-fold CV
adaLasso <- function(data, labels, parallel = TRUE, standardize = TRUE, weight, gamma = 1, formula = NULL, ols.data = NULL, ridge.lambda = NULL, lasso.lambda = NULL, seed = 1){
require(glmnet)
if(!(weight %in% c("ols", "ridge"))){
stop("The parameter 'weight' should be chosen either ols or ridge.")
}
if(weight == "ols"){
if(is.null(ols.data)){
stop("If you want to use the coefficients of OLS as the weight for Adaptive Lasso, you have to put a data.frame to ols.data argument.")
}
ols <- lm(formula = formula, data = ols.data)
weight <- 1/abs(as.matrix(coefficients(ols)[-1]))^gamma
}
if(weight == "ridge"){
set.seed(seed)
if(parallel)
doMC::registerDoMC(cores = 4)
cv.ridge <- cv.glmnet(x = data, y = labels, alpha = 0, parallel = parallel, standardize = standardize, lambda = ridge.lambda)
weight <- 1/abs(matrix(coef(cv.ridge, s = cv.ridge$lambda.min)[-1, ]))^gamma
}
weight[,1][weight[, 1] == Inf] <- 999999999
set.seed(seed)
if(parallel)
doMC::registerDoMC(cores = 4)
cv.lasso <- cv.glmnet(x = data, y = labels, alpha = 1, parallel = parallel, standardize = standardize, lambda = lasso.lambda, penalty.factor = weight)
cv.lasso
}set.seed(2017)
ridge.lambda <- exp(-seq(7, 8, length.out = 400))
lasso.lambda <- exp(-seq(9, 10, length.out = 400))
adaLasso.cv <- adaLasso(
data = as.matrix(train[, -1]),
label = log(train$price),
weight = "ridge",
ridge.lambda = ridge.lambda,
lasso.lambda = lasso.lambda,
parallel = TRUE
)adaLasso.pred <- predict(adaLasso.cv, as.matrix(test[, -1]), s = adaLasso.cv$lambda.min)
adaLasso.pred <- as.vector(exp(adaLasso.pred))
adaLasso.rmsle <- rmsle(adaLasso.pred, test$price)
cat("RMSLE (adaLasso):", adaLasso.rmsle)## RMSLE (adaLasso): 0.1796296
adaLasso.mse <- mse(adaLasso.pred, test$price)
cat("MSE (adaLasso):", adaLasso.mse)## MSE (adaLasso): 190009.3
library(reshape2)
reg.result <- data.frame(Method = c("Least Square", "Ridge\nRegression", "Lasso", "Adaptive Lasso"),
RMSLE = c(result.rmsle, ridge.rmsle, lasso.rmsle, adaLasso.rmsle),
MSE = c(result.mse, ridge.mse, lasso.mse, adaLasso.mse))
reg.result <- melt(reg.result, id.vars = "Method",
variable.name = "Metric",
value.name = "Score")
reg.result$Method <- factor(reg.result$Method,
levels = c("Least Square", "Lasso", "Ridge\nRegression", "Adaptive Lasso"))
ggplot(reg.result, aes(x = Method, y = Score, group = Metric)) +
geom_line() + geom_point(aes(color = Method, shape = Method), size = 5) +
facet_grid(Metric ~ ., scales = "free_y") +
geom_text(aes(label = ifelse(Score < 1, round(Score, 7), round(Score, 1))),
size = 3, hjust = 1.3, vjust = 1.3, fontface = 'bold')library(ranger)
set.seed(2017)
rf <- ranger(
formula = log(price) ~ .,
data = train,
num.trees = 2000,
importance = 'impurity',
write.forest = TRUE
)## Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
## Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 39 seconds.
## Growing trees.. Progress: 57%. Estimated remaining time: 1 minute, 9 seconds.
## Growing trees.. Progress: 77%. Estimated remaining time: 37 seconds.
rf.pred <- predict(rf, test)
rf.pred <- exp(rf.pred$predictions)
rf.rmsle <- rmsle(rf.pred, test$price)
cat("RMSLE (Random Forest):", rf.rmsle)## RMSLE (Random Forest): 0.1775066
rf.mse <- mse(rf.pred, test$price)
cat("MSE (Random Forest):", rf.mse)## MSE (Random Forest): 155200.4
library(xgboost)
params <- list(
eta = 0.3,
gamma = 0,
max_depth = 5,
min_child_weight = 1,
subsample = 1,
colample_bytree = 1,
objective = "reg:linear",
eval_metric = "rmse"
)
set.seed(2017)
xgb.cv <- xgb.cv(
params = params,
data = as.matrix(train[, -1]),
nrounds = 200,
nfold = 10,
label = log(train$price),
verbose = 1,
print_every_n = 25
)## [1] train-rmse:8.797288+0.000701 test-rmse:8.797174+0.008145
## [26] train-rmse:0.163139+0.001116 test-rmse:0.180500+0.006925
## [51] train-rmse:0.142965+0.001086 test-rmse:0.171362+0.006885
## [76] train-rmse:0.131548+0.000993 test-rmse:0.168632+0.006659
## [101] train-rmse:0.122853+0.001172 test-rmse:0.167706+0.006633
## [126] train-rmse:0.115637+0.001038 test-rmse:0.166971+0.006473
## [151] train-rmse:0.109521+0.000955 test-rmse:0.166733+0.006552
## [176] train-rmse:0.104160+0.001013 test-rmse:0.166646+0.006674
## [200] train-rmse:0.099301+0.000813 test-rmse:0.166665+0.006460
best.xgb <- xgb.cv$evaluation_log %>%
arrange(test_rmse_mean, test_rmse_std) %>%
head(1)
best.xgb| iter | train_rmse_mean | train_rmse_std | test_rmse_mean | test_rmse_std |
|---|---|---|---|---|
| 161 | 0.1073536 | 0.0011955 | 0.1665503 | 0.0066255 |
iter <- best.xgb$iter
xgb <- xgboost(
data = as.matrix(train[, -1]),
nrounds = iter,
label = log(train$price),
verbose = 1,
print_every_n = 25
)## [1] train-rmse:8.797178
## [26] train-rmse:0.151631
## [51] train-rmse:0.130160
## [76] train-rmse:0.116726
## [101] train-rmse:0.107669
## [126] train-rmse:0.098944
## [151] train-rmse:0.092342
## [161] train-rmse:0.090079
xgb.pred <- predict(xgb, as.matrix(test[, -1]))
xgb.pred <- exp(xgb.pred)
xgb.rmsle <- rmsle(xgb.pred, test$price)
cat("RMSLE (Xgboost):", xgb.rmsle)## RMSLE (Xgboost): 0.1654061
xgb.mse <- mse(xgb.pred, test$price)
cat("MSE (Xgboost):", xgb.mse)## MSE (Xgboost): 124287.6
library(caret)
result.r2 <- postResample(pred, test$price)[2]
ridge.r2 <- postResample(ridge.pred, test$price)[2]
lasso.r2 <- postResample(lasso.pred, test$price)[2]
adaLasso.r2 <- postResample(adaLasso.pred, test$price)[2]
rf.r2 <- postResample(rf.pred, test$price)[2]
xgb.r2 <- postResample(xgb.pred, test$price)[2]final.result <- data.frame(Method = c("Least Square", "Ridge\nRegression", "Lasso", "Adaptive\nLasso", "Random\nForest", "XGBoost"),
MSE = c(result.mse, ridge.mse, lasso.mse, adaLasso.mse, rf.mse, xgb.mse),
RMSLE = c(result.rmsle, ridge.rmsle, lasso.rmsle, adaLasso.rmsle, rf.rmsle, xgb.rmsle),
R2 = c(result.r2, ridge.r2, lasso.r2, adaLasso.r2, rf.r2, xgb.r2))
final.result <- melt(final.result, id.vars = "Method",
variable.name = "Metric",
value.name = "Score")
final.result$Method <- factor(final.result$Method,
levels = c("Least Square", "Lasso", "Ridge\nRegression", "Adaptive\nLasso", "Random\nForest", "XGBoost"))
ggplot(final.result, aes(x = Method, y = Score, group = Metric)) +
geom_line(linetype = "dashed") + geom_point(aes(color = Method, shape = Method), size = 5) +
facet_grid(Metric ~ ., scales = "free_y") +
geom_text(aes(label = ifelse(Score < 1, round(Score, 7), round(Score, 1))),
size = 2.5, hjust = 1.2, vjust = 1.4, fontface = 'bold.italic') +
theme(legend.key.size = unit(2, 'lines'))library(car)##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif <- vif((lm(log(price) ~ bedrooms + bathrooms + sqft_living + sqft_lot +
floors + waterfront + view + condition + grade + sqft_above +
long + sqft_living15 + sqft_lot15 + lat1 + lat2 + lat3 +
isRenovated + age + zipcode98001 + zipcode98002 +
zipcode98003 + zipcode98004 + zipcode98005 + zipcode98006 +
zipcode98007 + zipcode98008 + zipcode98010 + zipcode98011 +
zipcode98014 + zipcode98019 + zipcode98022 + zipcode98023 +
zipcode98024 + zipcode98027 + zipcode98028 + zipcode98029 +
zipcode98030 + zipcode98031 + zipcode98032 + zipcode98033 +
zipcode98034 + zipcode98038 + zipcode98039 + zipcode98040 +
zipcode98042 + zipcode98045 + zipcode98052 + zipcode98053 +
zipcode98055 + zipcode98056 + zipcode98058 + zipcode98059 +
zipcode98065 + zipcode98070 + zipcode98072 + zipcode98074 +
zipcode98075 + zipcode98077 + zipcode98092 + zipcode98102 +
zipcode98103 + zipcode98105 + zipcode98106 + zipcode98107 +
zipcode98108 + zipcode98109 + zipcode98112 + zipcode98115 +
zipcode98116 + zipcode98117 + zipcode98118 + zipcode98119 +
zipcode98122 + zipcode98125 + zipcode98126 + zipcode98133 +
zipcode98136 + zipcode98144 + zipcode98146 + zipcode98148 +
zipcode98155 + zipcode98166 + zipcode98168 + zipcode98177 +
zipcode98178 + zipcode98188 + zipcode98198, data = train)))
names(vif[vif > 10])## [1] "long" "lat1" "lat2" "lat3"
## [5] "zipcode98022" "zipcode98023" "zipcode98038" "zipcode98042"
## [9] "zipcode98045" "zipcode98065" "zipcode98092"